evaluation technique
The Case for Evaluating Causal Models Using Interventional Measures and Empirical Data
Causal inference is central to many areas of artificial intelligence, including complex reasoning, planning, knowledge-base construction, robotics, explanation, and fairness. An active community of researchers develops and enhances algorithms that learn causal models from data, and this work has produced a series of impressive technical advances. However, evaluation techniques for causal modeling algorithms have remained somewhat primitive, limiting what we can learn from experimental studies of algorithm performance, constraining the types of algorithms and model representations that researchers consider, and creating a gap between theory and practice. We argue for more frequent use of evaluation techniques that examine interventional measures rather than structural or observational measures, and that evaluate those measures on empirical data rather than synthetic data. We survey the current practice in evaluation and show that the techniques we recommend are rarely used in practice. We show that such techniques are feasible and that data sets are available to conduct such evaluations. We also show that these techniques produce substantially different results than using structural measures and synthetic data.
The Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong
Petelin, Gašper, Cenikj, Gjorgjina
Algorithm selection, aiming to identify the best algorithm for a given problem, plays a pivotal role in continuous black-box optimization. A common approach involves representing optimization functions using a set of features, which are then used to train a machine learning meta-model for selecting suitable algorithms. Various approaches have demonstrated the effectiveness of these algorithm selection meta-models. However, not all evaluation approaches are equally valid for assessing the performance of meta-models. We highlight methodological issues that frequently occur in the community and should be addressed when evaluating algorithm selection approaches. First, we identify flaws with the "leave-instance-out" evaluation technique. We show that non-informative features and meta-models can achieve high accuracy, which should not be the case with a well-designed evaluation framework. Second, we demonstrate that measuring the performance of optimization algorithms with metrics sensitive to the scale of the objective function requires careful consideration of how this impacts the construction of the meta-model, its predictions, and the model's error. Such metrics can falsely present overly optimistic performance assessments of the meta-models. This paper emphasizes the importance of careful evaluation, as loosely defined methodologies can mislead researchers, divert efforts, and introduce noise into the field
The Case for Evaluating Causal Models Using Interventional Measures and Empirical Data
Causal inference is central to many areas of artificial intelligence, including complex reasoning, planning, knowledge-base construction, robotics, explanation, and fairness. An active community of researchers develops and enhances algorithms that learn causal models from data, and this work has produced a series of impressive technical advances. However, evaluation techniques for causal modeling algorithms have remained somewhat primitive, limiting what we can learn from experimental studies of algorithm performance, constraining the types of algorithms and model representations that researchers consider, and creating a gap between theory and practice. We argue for more frequent use of evaluation techniques that examine interventional measures rather than structural or observational measures, and that evaluate those measures on empirical data rather than synthetic data. We survey the current practice in evaluation and show that the techniques we recommend are rarely used in practice. We show that such techniques are feasible and that data sets are available to conduct such evaluations.
Quantifying AI Vulnerabilities: A Synthesis of Complexity, Dynamical Systems, and Game Theory
We propose a novel approach that introduces three metrics: System Complexity Index (SCI), Lyapunov Exponent for AI Stability (LEAIS), and Nash Equilibrium Robustness (NER). SCI quantifies the inherent complexity of an AI system, LEAIS captures its stability and sensitivity to perturbations, and NER evaluates its strategic robustness against adversarial manipulation. Through comparative analysis, we demonstrate the advantages of our framework over existing techniques. We discuss the theoretical and practical implications, potential applications, limitations, and future research directions. Our work contributes to the development of secure and trustworthy AI technologies by providing a holistic, theoretically grounded approach to AI security evaluation. As AI continues to advance, prioritising and advancing AI security through interdisciplinary collaboration is crucial to ensure its responsible deployment for the benefit of society.
From Requirements to Architecture: An AI-Based Journey to Semi-Automatically Generate Software Architectures
Eisenreich, Tobias, Speth, Sandro, Wagner, Stefan
Designing domain models and software architectures represents a significant challenge in software development, as the resulting architectures play a vital role in fulfilling the system's quality of service. Due to time pressure, architects often model only one architecture based on their known limited domain understanding, patterns, and experience instead of thoroughly analyzing the domain and evaluating multiple candidates, selecting the best fitting. Existing approaches try to generate domain models based on requirements, but still require time-consuming manual effort to achieve good results. Therefore, in this vision paper, we propose a method to generate software architecture candidates semi-automatically based on requirements using artificial intelligence techniques. We further envision an automatic evaluation and trade-off analysis of the generated architecture candidates using, e.g., the architecture trade-off analysis method combined with large language models and quantitative analyses. To evaluate this approach, we aim to analyze the quality of the generated architecture models and the efficiency and effectiveness of our proposed process by conducting qualitative studies.
A Comprehensive Survey of Evaluation Techniques for Recommendation Systems
The effectiveness of recommendation systems is pivotal to user engagement and satisfaction in online platforms. As these recommendation systems increasingly influence user choices, their evaluation transcends mere technical performance and becomes central to business success. This paper addresses the multifaceted nature of recommendations system evaluation by introducing a comprehensive suite of metrics, each tailored to capture a distinct aspect of system performance. We discuss * Similarity Metrics: to quantify the precision of content-based filtering mechanisms and assess the accuracy of collaborative filtering techniques. * Candidate Generation Metrics: to evaluate how effectively the system identifies a broad yet relevant range of items. * Predictive Metrics: to assess the accuracy of forecasted user preferences. * Ranking Metrics: to evaluate the effectiveness of the order in which recommendations are presented. * Business Metrics: to align the performance of the recommendation system with economic objectives. Our approach emphasizes the contextual application of these metrics and their interdependencies. In this paper, we identify the strengths and limitations of current evaluation practices and highlight the nuanced trade-offs that emerge when optimizing recommendation systems across different metrics. The paper concludes by proposing a framework for selecting and interpreting these metrics to not only improve system performance but also to advance business goals. This work is to aid researchers and practitioners in critically assessing recommendation systems and fosters the development of more nuanced, effective, and economically viable personalization strategies. Our code is available at GitHub - https://github.com/aryan-jadon/Evaluation-Metrics-for-Recommendation-Systems.
Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents
Smith, Eric Michael, Hsu, Orion, Qian, Rebecca, Roller, Stephen, Boureau, Y-Lan, Weston, Jason
At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv:1603.08023), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.
An effective hybrid search algorithm for the multiple traveling repairman problem with profits
Ren, Jintong, Hao, Jin-Kao, Wu, Feng, Fu, Zhang-Hua
As an extension of the traveling repairman problem with profits, the multiple traveling repairman problem with profits consists of multiple repairmen who visit a subset of all customers to maximize the revenues collected through the visited customers. To solve this challenging problem, an effective hybrid search algorithm based on the memetic algorithm framework is proposed. It integrates two distinguished features: a dedicated arc-based crossover to generate high-quality offspring solutions and a fast evaluation technique to reduce the complexity of exploring the classical neighborhoods. We show the competitiveness of the algorithm on 470 benchmark instances compared to the leading reference algorithms and report new best records for 137 instances as well as equal best results for other 330 instances. We investigate the importance of the key search components for the algorithm.